32 research outputs found
Data-Efficient Contrastive Self-supervised Learning: Easy Examples Contribute the Most
Self-supervised learning (SSL) learns high-quality representations from large
pools of unlabeled training data. As datasets grow larger, it becomes crucial
to identify the examples that contribute the most to learning such
representations. This enables efficient SSL by reducing the volume of data
required for learning high-quality representations. Nevertheless, quantifying
the value of examples for SSL has remained an open question. In this work, we
address this for the first time, by proving that examples that contribute the
most to contrastive SSL are those that have the most similar augmentations to
other examples, in expectation. We provide rigorous guarantees for the
generalization performance of SSL on such subsets. Empirically, we discover,
perhaps surprisingly, the subsets that contribute the most to SSL are those
that contribute the least to supervised learning. Through extensive
experiments, we show that our subsets outperform random subsets by more than 3%
on CIFAR100, CIFAR10, and STL10. Interestingly, we also find that we can safely
exclude 20% of examples from CIFAR100 and 40% from STL10, without affecting
downstream task performance.Comment: Accepted to ICML 202
Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks
Contrastive Language-Image Pre-training (CLIP) on large image-caption
datasets has achieved remarkable success in zero-shot classification and
enabled transferability to new domains. However, CLIP is extremely more
vulnerable to targeted data poisoning and backdoor attacks, compared to
supervised learning. Perhaps surprisingly, poisoning 0.0001% of CLIP
pre-training data is enough to make targeted data poisoning attacks successful.
This is four orders of magnitude smaller than what is required to poison
supervised models. Despite this vulnerability, existing methods are very
limited in defending CLIP models during pre-training. In this work, we propose
a strong defense, SAFECLIP, to safely pre-train CLIP against targeted data
poisoning and backdoor attacks. SAFECLIP warms up the model by applying
unimodal contrastive learning (CL) on image and text modalities separately.
Then, it carefully divides the data into safe and risky subsets. SAFECLIP
trains on the risky data by applying unimodal CL to image and text modalities
separately, and trains on the safe data using the CLIP loss. By gradually
increasing the size of the safe subset during the training, SAFECLIP
effectively breaks targeted data poisoning and backdoor attacks without harming
the CLIP performance. Our extensive experiments show that SAFECLIP decrease the
attack success rate of targeted data poisoning attacks from 93.75% to 0% and
that of the backdoor attacks from 100% to 0%, without harming the CLIP
performance on various datasets
Adversarially Robust Submodular Maximization under Knapsack Constraints
We propose the first adversarially robust algorithm for monotone submodular
maximization under single and multiple knapsack constraints with scalable
implementations in distributed and streaming settings. For a single knapsack
constraint, our algorithm outputs a robust summary of almost optimal (up to
polylogarithmic factors) size, from which a constant-factor approximation to
the optimal solution can be constructed. For multiple knapsack constraints, our
approximation is within a constant-factor of the best known non-robust
solution.
We evaluate the performance of our algorithms by comparison to natural
robustifications of existing non-robust algorithms under two objectives: 1)
dominating set for large social network graphs from Facebook and Twitter
collected by the Stanford Network Analysis Project (SNAP), 2) movie
recommendations on a dataset from MovieLens. Experimental results show that our
algorithms give the best objective for a majority of the inputs and show strong
performance even compared to offline algorithms that are given the set of
removals in advance.Comment: To appear in KDD 201
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
To improve the efficiency and sustainability of learning deep models, we
propose CREST, the first scalable framework with rigorous theoretical
guarantees to identify the most valuable examples for training non-convex
models, particularly deep networks. To guarantee convergence to a stationary
point of a non-convex function, CREST models the non-convex loss as a series of
quadratic functions and extracts a coreset for each quadratic sub-region. In
addition, to ensure faster convergence of stochastic gradient methods such as
(mini-batch) SGD, CREST iteratively extracts multiple mini-batch coresets from
larger random subsets of training data, to ensure nearly-unbiased gradients
with small variances. Finally, to further improve scalability and efficiency,
CREST identifies and excludes the examples that are learned from the coreset
selection pipeline. Our extensive experiments on several deep networks trained
on vision and NLP datasets, including CIFAR-10, CIFAR-100, TinyImageNet, and
SNLI, confirm that CREST speeds up training deep networks on very large
datasets, by 1.7x to 2.5x with minimum loss in the performance. By analyzing
the learning difficulty of the subsets selected by CREST, we show that deep
models benefit the most by learning from subsets of increasing difficulty
levels
The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data
Increasing the size of overparameterized neural networks has been shown to
improve their generalization performance. However, real-world datasets often
contain a significant fraction of noisy labels, which can drastically harm the
performance of the models trained on them. In this work, we study how neural
networks' test loss changes with model size when the training set contains
noisy labels. We show that under a sufficiently large noise-to-sample size
ratio, generalization error eventually increases with model size. First, we
provide a theoretical analysis on random feature regression and show that this
phenomenon occurs as the variance of the generalization loss experiences a
second ascent under large noise-to-sample size ratio. Then, we present
extensive empirical evidence confirming that our theoretical results hold for
neural networks. Furthermore, we empirically observe that the adverse effect of
network size is more pronounced when robust training methods are employed to
learn from noisy-labeled data. Our results have important practical
implications: First, larger models should be employed with extra care,
particularly when trained on smaller dataset or using robust learning methods.
Second, a large sample size can alleviate the effect of noisy labels and allow
larger models to achieve a superior performance even under noise.Comment: added more experiments and discussion on sample siz
Lazier Than Lazy Greedy
Is it possible to maximize a monotone submodular function faster than the
widely used lazy greedy algorithm (also known as accelerated greedy), both in
theory and practice? In this paper, we develop the first linear-time algorithm
for maximizing a general monotone submodular function subject to a cardinality
constraint. We show that our randomized algorithm, STOCHASTIC-GREEDY, can
achieve a approximation guarantee, in expectation, to the
optimum solution in time linear in the size of the data and independent of the
cardinality constraint. We empirically demonstrate the effectiveness of our
algorithm on submodular functions arising in data summarization, including
training large-scale kernel methods, exemplar-based clustering, and sensor
placement. We observe that STOCHASTIC-GREEDY practically achieves the same
utility value as lazy greedy but runs much faster. More surprisingly, we
observe that in many practical scenarios STOCHASTIC-GREEDY does not evaluate
the whole fraction of data points even once and still achieves
indistinguishable results compared to lazy greedy.Comment: In Proc. Conference on Artificial Intelligence (AAAI), 201